Before you begin…

Motivating questions

  1. What assumptions does a GWAS make about a population?

  2. What happens when these assumptions are not met?

Objectives of this module

When you have completed this module, you will know:

  1. How to summarize the assumptions (biological and statistical) that a GWAS makes about the individuals in a study and the structure of the data.

  2. How to identify instances when your data show evidence of not meeting these assumptions.

  3. What tools are available for analyzing complex data that do not meet the basic GWAS assumptions.

  4. How to implement the tools mentioned above using R

Set up

The computational methods section of this module is divided into two sections, according to the type of data set you are bringing into this module. If you skipped the imputation module of this tutorial, or if your data was too big to read into your computer’s memory, read the section labeled “PCA for large/not fully imputed data” for your computational methods. If you completed imputation for your data set, refer to the “PCA for fully-imputed data” section.

Remember: “fully imputed” means that there are no missing values in the data set.

As a start, load the packages we will need for this module:

# Load our packages  (same ones mentioned in the data module)
# two packages are from Bioconductor 
library(snpStats) 
library(SNPRelate)

# all other packages are on CRAN 
library(data.table)
library(dplyr)
library(magrittr)
library(ggplot2)
library(ncvreg)
library(RSpectra)

# register cores for parallel processing 
library(doSNOW)
registerDoSNOW(makeCluster(4))

Population structure

Concept

In a GWAS context, population structure is a kind of relatedness among the individuals represented in the data set. When considering relatedness, it is helpful to define the phrase identical by descent. Two alleles from the same locus are identical by descent if they can be traced back from the same allele in an earlier generation. For more information on this concept, check out this example.

Population structure is defined by the existence of allele frequency differences that characterize sub-populations and is driven by the combined effects of evolutionary processes such as genetic drift, migration, demographic history, and natural selection.

Population structure is a common phenomenon in genetic data. Varying levels of relatedness are almost always present among genetic samples, even in samples of unrelated individuals and seemingly homogeneous populations. For example, European American, Han Chinese, and recently, cohorts within the UK Biobank data have been shown to exhibit patterns of population and geographic structure despite their seemingly similar subjects.

Population structure is broadly categorized based on whether it describes recent or ancient relatedness. Ancient relatedness describes the presence of a common ancestor many generations previously. The presence of distinct ancestry groups with different allele frequencies in a sample is known as population stratification (a.k.a ancestry differences). Recent relatedness describes the sharing of a common ancestor only several generations previously. Pedigree-based methods may be used to explicitly model recent relatedness if familial relationships are known. In the absence of known familial relationships, recent relatedness is referred to as cryptic relatedness. 1

Population stratification

Population stratification in particular has been of great concern in genetic studies due to its potential to lead to spurious associations when population structure is associated with differences in both allele frequency and the trait or disease.

As an example, consider a GWAS to assess genetic variants associated with lung cancer in a sample comprised of subjects from two distinct subpopulations, A and B. Assume the minor allele of SNP X is present with higher frequency in subpopulation A compared to subpopulation B, but has no direct effect on lung cancer. Also suppose these subpopulations are geographically segregated in a such a way that subpopulation A is exposed to good air quality, and subpopulation B to poor air quality, and that air quality has a direct effect on lung cancer. A GWAS of data from these subpopulations would find SNP X to be significantly associated with lung cancer, even though we know it has no effect on lung cancer. If subpopulations A and B were not subject to different air qualities, all else being equal, SNP X would not be found to be associated with the phenotype.

Another apocryphal but illustrative example of how population stratification can lead to confounding in genetic studies is linked below.

Cryptic relatedness

In the context of GWAS data, cryptic relatedness is the phrase used to describe a situation when individuals in a study are more closely related than assumed by the investigators. For example, it could be possible that some of the individuals in a study are first cousins, unbeknownst to the researcher(s). This kind of unknown relatedness can act as a confounding factor in a GWAS analysis. Like population stratification, cryptic relatedness can lead to inflated false positive rates in gene-association studies. The former is usually characterized by having differences among groups of subjects, while the latter is typically characterized by unknown relatedness among individual subjects in the study.

Tools/implementation

GWAS (and many statistical tests) assume samples are independent. Cryptic relatedness and population structure can invalidate these tests since the presence of non-independent samples, and thus non-independent errors, can lead to inflated test statistics.

With this in mind, it is critical to evaluate and account for potential population structure in our data in order to avoid false positives and negatives.

Principal component analysis (PCA)

Brief introduction

One of the simplest and most common methods used to assess and correct for population structure in our data is with principal component analysis (PCA). Here, I will present a ‘crash course’ on the concepts of PCA and provide an example.

Conceptually, PCA can be thought of as extracting the axes of greatest variability in our data. PCA creates linear combinations of a given set of variables; this lets the user represent a lot of correlated variables with a smaller number of uncorrelated variables - so PCA gives us dimension reduction, which is really helpful for high dimensional settings like GWAS data.

Connections to different conceptual frameworks:

  • From a linear modeling framework, PCA is an orthogonal linear transformation of a data set.

  • From a machine learning framework, PCA is a type of unsupervised learning.

As a brief example with a smaller data set, I’m going to illustrate some of these concepts using some smaller data with known structure.

First, we’ll read in the small data and filter out the monomorphic SNPs. Note that this data set contains a known race variable.

# read in smaller data 
smaller_data <- read.delim("https://s3.amazonaws.com/pbreheny-data-sets/admixture.txt")
# see what it looks like 
smaller_data[1:5, 1:5]
#               Race Snp1 Snp2 Snp3 Snp4
# 1 African American    0    0    0    0
# 2 African American    0    0    0    0
# 3 African American    0    0    0    0
# 4 African American    0    0    0    0
# 5 African American    1    0    0    1
# assign the 'race' variable 
race <- smaller_data$Race
 # create a matrix with only SNP data 
all_SNPs <- as.matrix(smaller_data[,-1])
# filter out monomorphic SNPs
polymorphic <- apply(all_SNPs, 2, sd) != 0
SNPs<- all_SNPs[,polymorphic] 

# look at the resulting matrix dimensions
dim(SNPs)
# [1] 197  98
# see what the data set looks like in terms of racial categories 
table(race)
# race
#          African African American         European         Japanese 
#               50               47               50               50

Next, I’m going to compute the principal components (PCs) and plot a scree plot. A scree plot tells us the proportion of variance explained by each of the PCs. PCA is such that the PCs are ordered from those that explain the greatest to least amount of variability. Scree plots are sometimes used to decide how many PCs are appropriate to include. We look for the ‘elbow’ in the plot, which indicates when the proportion of variance explained by including additional PCs may not be worth the extra df. There are also a number of more complex tests and tools to determine this as well, or often the top 10 PCs are simply used in practice.

# use the convenient prcomp() function to do all the PCA steps in one line! 
pca <- prcomp(SNPs, center = TRUE, scale = TRUE)
# look at the results 
pca$x[1:5, 1:5]
#            PC1        PC2        PC3        PC4        PC5
# [1,] 1.6652335 -0.3909691 -0.7799540 -2.3234440 -2.0802964
# [2,] 1.3048975 -2.2997440 -0.4810622  1.9478506  1.6806863
# [3,] 3.2633748 -0.1948454 -0.6283735  2.7211461  0.9575737
# [4,] 0.5371343 -1.5322968  0.6324284 -0.4261897  2.3696399
# [5,] 3.3810463 -1.2142128 -1.8344268 -0.8499357 -1.6181324
# plot the top 10 PCs in a scree plot 
plot(x = 1:10,
     y = 100 * proportions(pca$sdev[1:10]^2),
     type = 'b',
     ylab = 'Proportion of variance explained',
     xlab = 'PC',
     main = 'Example Scree Plot')

We see this elbow point at 3 PCs, which makes sense, given that there are 4 distinct races in this data set (we lose one degree of freedom when we center our data). The first PC explains about \(30 \%\) of the variance in our data, while the second PC explains a little less than \(20 \%\). Together, this means that the first 3 PCs explain over half of the variance in the data set.

Now we’ll plot the first two PCs against each other, and color the data points by the known race of each subject. We expect to see clustering that corresponds to population structure.

pca_dat <- data.frame(race = race, PC1 = pca$x[,1], PC2 = pca$x[, 2])
pca_plot <- ggplot(pca_dat, aes(x = PC1, y = PC2, col = race)) +
  geom_point() +
  coord_fixed()
plot(pca_plot)

Indeed, we see that the first two PCs differentiate the racial groups, which cluster together. We can see that PC1 seems to differentiate the European/Japanese populations from the African/African American ones, while PC2 seems to primarily differentiate the European and Japanese populations. Even if we did not have the known race vector to color the points, we could still pick out some clustering from this plot, which indicates underlying structure. If there were no underlying structure in our data, we would expect to see no clustering or systematic pattern in this plot.

As a counter example to this plot, we can plot the PCs from a random matrix (i.e. a matrix with no population structure):

# for the sake of illustration, I'm going to use the same dimensions as those 
#   in the smaller data in the previous example.
n <- 197
p <- 98
X <- matrix(rnorm(n * p), n, p)
pca_rand <- prcomp(X, center = TRUE, scale = TRUE)
rand_dat <- data.frame(PC1 = pca_rand$x[,1], PC2 = pca_rand$x[, 2])
rand_plot <- ggplot(rand_dat,
                    aes(x = PC1, y = PC2)) +
  geom_point() +
  coord_fixed()
plot(rand_plot)

We see that in this plot, there is underlying structure = no clustering or pattern. If I saw results like this “in the wild”, I would be fairly confident that population structure is not a major issue in my data set.

PCA for fully-imputed data

Now, let’s go back to our running example with the large SNP data set. For our case, let’s call our SNP data \(X\). I can think of \(X\) as a matrix, where I have \(n\) rows (in this case, n = 1401) and \(p\) is the number of columns (in this case, p = 752,675). Since we have \(p >> n\), we cannot simply use the spectral decomposition to calculate the eigenvalues of the covariance matrix \(\mathbf{S} \equiv \tilde{\mathbf{X}} \tilde{\mathbf{X}}^T\), as this \(\mathbf{S}\) may not be positive definite. Instead, we will use a method that exploits the power of the singular value decomposition:

  1. Standardize (center and scale) the data \(\mathbf{X}\), and call the resulting matrix \(\tilde{\mathbf{X}}\).

  2. Calculate \(\tilde{\mathbf{X}} \equiv \mathbf{U}\mathbf{D}\mathbf{V}^T\) using the singular value decomposition (SVD). Here, \(\mathbf{U}\) and \(\mathbf{V}\) are both orthogonal matrices, and \(\mathbf{D}\) is a diagonal matrix of non-negative numbers. We say that \(\mathbf{U}\) and \(\mathbf{V}\) contain the left and right singular vectors, respectively. The values on the diagonal of \(\mathbf{D}\) are the singular values.

  3. We can calculate the principal components using the vectors of \(\mathbf{U}\) and the matrix \(\mathbf{D}\), which is super cool! This is possible because the columns of \(\mathbf{U}\) are eigenvectors of the covariance matrix \(\mathbf{S}\). Specifically, the \(i^{th}\) principal component \(p_i\) can be calculated as \(p_i = \mathbf{u}_i d_i\), where \(\mathbf{u}_i\) is the \(i^{th}\) column of \(\mathbf{U}\) and \(d_i\) is the \(i^{th}\) value on the diagonal of \(\mathbf{D}\).

We will put these steps into practice now - if you have saved the fully imputed data set, load it here:

X <- readRDS(file = "data/fully_imputed_numeric.rds")

Remember, we cannot use the prcomp() function on our SNP data set - it is too big. We will need some other tools:

  • I will use the function std() from the ncvreg package to standardize \(\mathbf{X}\)

  • I will use the svds() function from the RSpectra package do implement the singular value decomposition. RSpectra will let us use a lower-dimension approximation to the SVD, making it feasible to work with this large data set.

  • I will use the sweep() function in base to return an array

Since we are usually only interested in the first few PCs, I will calculate just the first 4 PCs in this example.

# Center and scale the design matrix 
X_scaled <- ncvreg::std(X)
# remove unscaled X object to save memory
rm(X)
# use singular value decomposition to derive the left singular vectors 
# NB: on my laptop, this took about 10 minutes to run 
singular_vectors <- RSpectra::svds(X_scaled,
                         k = 4,
                         nu = 4, # NB: default is nu = k
                         nv = 0)
# check the dimension of the returned "u" matrix and "d" vector 
dim(singular_vectors$u) 
# [1] 1401    4
length(singular_vectors$d)
# [1] 4
# calculate the PCs                          
PCs <- sweep(singular_vectors$u, 2, singular_vectors$d, "*")
dim(PCs) # will have same dimensions as u
# [1] 1401    4
# save the PCs and scaled X matrix
# - will need them for analysis 

saveRDS(X_scaled,
        file = "data/fully_imputed_scaled.rds")

saveRDS(PCs, 
        file = "data/PCs_base.rds")
# find the standard deviations of each PC
PC_sds <- apply(X = PCs,
                MARGIN = 2, # columns are PCs
                FUN = sd)
# create a scree plot
plot(x = 1:ncol(PCs),
     y =  100 * proportions(PC_sds^2),
     type = 'b',
     ylab = 'Proportion of variance explained',
     xlab = 'PC',
     main = 'Scree Plot: SNP data')

Here, we see that the first PC explains over half of the variance in the data set. This indicates that there is probably evidence of population structure in this data set. Based on this, I expect that a plot of the principal components themselves will show strong clustering.

# create a plot of the first two PCs
PC_dat <- as_tibble(PCs) # make a data set (better than a matrix for plotting)
# Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
# Using compatibility `.name_repair`.
# This warning is displayed once every 8 hours.
# Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
names(PC_dat) <- paste0("PC", 1:ncol(PCs))
# plot these principal components
PC_dat %>%
  ggplot(aes(x = PC1, y = PC2)) +
  geom_point() +
  coord_fixed()

We notice in the plot above that there is indeed notable clustering - we have evidence that our SNP data does have population structure. We will need to keep this in mind as we move forward to analysis, as this will guide our choice(s) of tools for analysis and interpretation.

Similar to examining racial group in our example PCA, we can use the clinical features in our SNP data set to see if there is any relationship between these clinical factors and the clustering we see in our PCA.

# load our clinical data 
clinical <- fread('data/GWAStutorial_clinical.csv') %>%
  mutate(across(.cols = c(sex, CAD),
                .fns = as.factor))
# remind ourselves of what clinical/demographic information we have 
head(clinical)
#    FamID CAD sex age  tg hdl ldl
# 1: 10002   1   1  60  NA  NA  NA
# 2: 10004   1   2  50  55  23  75
# 3: 10005   1   1  55 105  37  69
# 4: 10007   1   1  52 314  54 108
# 5: 10008   1   1  58 161  40  94
# 6: 10009   1   1  59 171  46  92
# plot these principal components, using color codes to differentiate sexes
PC_dat %>%
  ggplot(aes(x = PC1, y = PC2,
             col = clinical$sex)) +
  geom_point() +
  coord_fixed() + 
  theme(legend.position = "none")

Based on the plot above, sex does not seem to be associated with the clustering we see in the principal components. In this case, we do not have access to further information on the demographics of the patients represented in the study. However, we can conjecture that unknown factors (e.g. racial/ethnic group, diet, etc.) likely varied among the study population, as authors reported that these data were collected at several sites in the US as well as in Germany (refer to the citation in the quality control module).

PCA for large/not fully imputed data

For really large data (or data that did not need imputation), we’ll be using tools from the package SNPRelate which allow us to do key computations without having a fully imputed SNP data set. To use SNPRelate functions, we need to get our data in a GDS format. SNPRelate has a function snpgdsBED2GDS() to convert PLINK binary data into a GDS file, bed/bim/fam \(\longrightarrow\) GDS, but as far as I am aware, there is no tool to convert SnpMatrix objects to GDS, SnpMatrix \(\longrightarrow\) GDS. So, we need to first convert our qc data back to bed/bim/fam using the snpStats function write.plink(), and then use that to create a GDS file: SnpMatrix \(\longrightarrow\) bed/bim/fam \(\longrightarrow\) GDS.

I’ll re-load the qc-data from the quality control module.

qc_data <- readRDS('data/gwas-qc.rds')
write.plink(
  file.base = "data/qc_data",
  snps = qc_data$genotypes,
  pedigree = qc_data$fam$pedigree,
  id = qc_data$fam$member,
  father = qc_data$fam$father,
  mother = qc_data$fam$mother,
  sex = clinical$sex,
  phenotype = clinical$CAD + 1,
  chromosome = qc_data$map$chromosome,
  genetic.distance = qc_data$map$cM,
  position = qc_data$map$position,
  allele.1 = qc_data$map$allele.1,
  allele.2 = qc_data$map$allele.2
)

If we wanted to spend the time using the mode imputation and rounding to get a qc-data set with absolutely no missingness, we could do bed/bim/fam/ \(\longrightarrow\) SnpMatrix bed/bim/fam/ \(\longrightarrow\) GDS (this is annoying).

We will now create our GDS file, use SNPRelate to compute the PCs, and plot them to see if it looks like there is any kind of sample structure as we saw in the previous example. We don’t have a known race or subpopulation status vector to color this plot, but we can look for clustering.

# create gds file so we can use SNPRelate - using unimputed qc data
qc_files <- lapply(c(bed='bed', bim='bim', fam='fam', gds='gds'), function(x) paste0('./data/qc_data.', x))

# IMHO, this next function is poorly named, but it will serve our purposes here... 
snpgdsBED2GDS(qc_files$bed, qc_files$fam, qc_files$bim, qc_files$gds) 
# Start file conversion from PLINK BED to SNP GDS ...
#     BED file: './data/qc_data.bed'
#         SNP-major mode (Sample X SNP), 252.0M
#     FAM file: './data/qc_data.fam'
#     BIM file: './data/qc_data.bim'
# Mon Aug 15 15:30:23 2022     (store sample id, snp id, position, and chromosome)
#     start writing: 1401 samples, 752675 SNPs ...
# 
[..................................................]  0%, ETC: ---        
[==================================================] 100%, completed, 30s
# Mon Aug 15 15:30:54 2022  Done.
# Optimize the access efficiency ...
# Clean up the fragments of GDS file:
#     open the file './data/qc_data.gds' (255.5M)
#     # of fragments: 39
#     save to './data/qc_data.gds.tmp'
#     rename './data/qc_data.gds.tmp' (255.5M, reduced: 252B)
#     # of fragments: 18

# open the GDS file
genofile <- snpgdsOpen(qc_files$gds)

# get PCs
pca <- snpgdsPCA(genofile) 
# Principal Component Analysis (PCA) on genotypes:
# Excluding 0 SNP on non-autosomes
# Excluding 0 SNP (monomorphic: TRUE, MAF: NaN, missing rate: NaN)
#     # of samples: 1,401
#     # of SNPs: 752,675
#     using 1 thread
#     # of principal components: 32
# PCA:    the sum of all selected genotypes (0,1,2) = 457958563
# CPU capabilities: Double-Precision SSE2
# Mon Aug 15 15:30:59 2022    (internal increment: 532)
# 
[..................................................]  0%, ETC: ---        
[==================================================] 100%, completed, 4.0m
# Mon Aug 15 15:35:02 2022    Begin (eigenvalues and eigenvectors)
# Mon Aug 15 15:35:02 2022    Done.

# close the file
snpgdsClose(genofile)

# plot
plot(1:10, pca$varprop[1:10], type = 'b', ylab = 'Proportion of variance explained', xlab = 'PC')



# put top 10 pcs in a table 
pctab <- data.frame(sample.id = pca$sample.id,
                    pca$eigenvect[, 1:10],
                    stringsAsFactors = FALSE)
names(pctab)[-1] <- paste0('PC', 1:10)

# plot 
pctab %>%
  as_tibble() %>%
  ggplot(aes(x = PC1, y = PC2)) + 
  geom_point() + 
  coord_fixed() + 
  theme(legend.position = "none")

Since the first four PCs explain most of the variance in the data, let’s save these for analysis downstream

saveRDS(pctab %>% dplyr::select(PC1:PC4),
        "data/PCs_SNPRelate.rds")

  1. NB: definitions of the terms recent and ancient are somewhat subjective and hand-wavy since, in theory, if you look back far enough, everyone shares a common ancestor. However, the idea is that humans migrated, separated, and mated such that over time distinct groups developed allele frequencies different enough to confound an analysis.↩︎